A Study of Semi-discrete Matrix Decomposition for LSI in Automated Text Categorization
نویسندگان
چکیده
This paper proposes the use of Latent Semantic Indexing (LSI) techniques, decomposed with semi-discrete matrix decomposition (SDD) method, for text categorization. The SDD algorithm is a recent solution to LSI, which can achieve similar performance at a much lower storage cost. In this paper, LSI is used for text categorization by constructing new features of category as combinations or transformations of the original features. In the experiments on data set of Chinese Library Classification we compare accuracy to a classifier based on k-Nearest Neighbor (k-NN) and the result shows that k-NN based on LSI is sometimes significantly better. Much future work remains, but the results indicate that LSI is a promising technique for text categorization.
منابع مشابه
Information Retrieval System in Bahasa Indonesia Using Latent Semantic Indexing and Semi-Discrete Matrix Decomposition
The focus of this paper is exploring the use of Latent Semantic Indexing (LSI) and Semi-Discrete Matrix Decomposition (SDD) in Bahasa Indonesia Information Retrieval System. The method is to take advantage of implicit higher-order structure in association of terms with document (" semantic structure ") in order to improve the detection of relevant document on the basis of terms found in queries...
متن کاملUsing Web Searches on Important Words to Create Background Sets for LSI Classification
The world wide web has a wealth of information that is related to almost any text classification task. This paper presents a method for mining the web to improve text classification, by creating a background text set. Our algorithm uses the information gain criterion to create lists of important words for each class of a text categorization problem. It then searches the web on various combinati...
متن کاملImproving Methods for Single-label Text Categorization
As the volume of information in digital form increases, the use of Text Categorization techniques aimed at finding relevant information becomes more necessary. To improve the quality of the classification, I propose the combination of different classification methods. The results show that k-NN-LSI, the combination of k-NNwith LSI, presents an average Accuracy on the five datasets that is highe...
متن کاملText Categorization and Information Retrieval Using WordNet Senses
In this paper we study the influence of semantics in the Text Categorization (TC) and Information Retrieval (IR) tasks. The K Nearest Neighbours (K-NN) method was used to perform the text categorization. The experimental results were obtained taking into account for a relevant term of a document its corresponding WordNet synset. For the IR task, three techniques were investigated: the direct us...
متن کاملAutomated Gene Classification using Nonnegative Matrix Factorization on Biomedical Literature
Understanding functional gene relationships is a challenging problem for biological applications. High-throughput technologies such as DNA microarrays have inundated biologists with a wealth of information, however, processing that information remains problematic. To help with this problem, researchers have begun applying text mining techniques to the biological literature. This work extends pr...
متن کامل